perm filename SPEECH.DOC[D,LES] blob sn#056013 filedate 1973-07-30 generic text, type T, neo UTF8
COMMENT ⊗   VALID 00010 PAGES 
RECORD PAGE   DESCRIPTION
 00001 00001
 00002 00002	%,2
 00004 00003	                              ABSTRACT
 00006 00004	%B,2,,,1
 00022 00005	2.  FACILITIES
 00024 00006	%B,1
 00027 00007	%B,2
 00043 00008	B.  SPEECH RESEARCH AT STANFORD UNIVERSITY
 00054 00009	%B,1
 00059 00010	D.  COGNIZANT PERSONNEL
 00061 ENDMK
⊗;
                                    

                                    

             Proposal for Speech Understanding Research





                                    

                                    

              Arthur Samuel, Senior Research Associate

                        Principal Investigator

                                    

                                    

                                    

                                    

                                    

                           Submitted to the

                  Advanced Research Projects Agency



                              July 1973







                                    

                                    

                                    

                     Computer Science Department

                  School of Humanities and Sciences

                         Stanford University
                              ABSTRACT





A two year research effort, beginning in October 1973, is proposed on

the  use  of automatic training methods to adapt speech understanding

systems to the characteristics of the speaker. This would  involve  a

small  staff of people with special capabilities in this field and it

would require a budget  of  $226,266.  No  new  facilities  would  be

required.



                          TABLE OF CONTENTS

 Section						       Page

    1.  Proposal . . . . . . . . . . . . . . . . . . . . . . .   1

    2.  Facilities . . . . . . . . . . . . . . . . . . . . . .   8

    3.  Budget . . . . . . . . . . . . . . . . . . . . . . . .  10

 Appendix

    A.  Initial Form of Signature Table for Speech Recognition  12

    B.  Speech Research at Stanford University . . . . . . . .  19

    C.  Bibliography . . . . . . . . . . . . . . . . . . . . .  24

    D.  Cognizant Personnel  . . . . . . . . . . . . . . . . .  26
1.  A PROPOSAL FOR SPEECH UNDERSTANDING RESEARCH





	It is proposed that the work on speech  recognition  that  is

now under way in the A.I. project at Stanford University be continued

and  extended  with  broadened  aims   in   the   field   of   speech

understanding.   This work gives considerable promise both of solving

some of  the  immediate  problems  that  beset  speech  understanding

research and of providing a basis for future advances.



	It is further proposed that this work be more closely tied to

the ARPA Speech Understanding Research effort than it has been in the

past and that it have as its express aim the study and application to

speech recognition of a machine learning  process,  that  has  proved

highly  successful  in  another application and that has already been

tested out to a limited extent in speech  recognition.   The  machine

learning  process  offers  both  an automatic training scheme and the

inherent ability of the system  to  adapt  to  various  speakers  and

dialects. Speech recognition via machine learning represents a global

approach to the speech recognition problem and  can  be  incorporated

into a wide class of limited vocabulary systems.



	Finally we would propose accepting responsibility for keeping

other ARPA projects supplied with  operating  versions  of  the  best

current programs that we have developed. The availability of the high

quality front end that the signature table  approach  provides  would


                                  1                                  
enable designers of the various over-all systems to test the relative

performance of the top-down portions of their systems without  having

to  make allowances for the deficiencies of their currently available

front ends. Indeed, if the signature table scheme can be made  simple

enough  to  compete on a time basis (and we believe that it can) then

it may replace the other front end  schemes  that  are  currently  in

favor.



	Stanford University is well suited as the site for such work,

having both the facilities for this work and a staff of  people  with

experience  and  interest in machine learning, phonetic analysis, and

digital signal processing. The  staff  at  present  consists  of  the

proposed Principal Investigator Arthur L. Samuel and Dr. Neil Miller,

who has had considerable experience in the analysis and synthesis  of

human  voice  signals using digital processes. An additional research

associate together with a few graduate students  would  complete  the

team.  It is anticipated that this staff of not more than 3 full time

members with the help of 2 or  3  graduate  students  could  mount  a

meaningful program, which should be funded for a mimimum of two years

to ensure continuity of effort. We would expect  to  demonstrate  the

utility  of the Signature Table approach within this time span and to

provide a working system that could be used as the front end for  any

of   the  speech  understanding  systems  that  are  currently  under

development or are being planned.






                                  2                                  
	Ultimately  we  would  like  to  have  a  system  capable  of

understanding  speech  from an unlimited domain of discourse and with

an unknown speaker. It seems not unreasonable to expect the system to

deal with this situation very much as people do when they adapt their

understanding processes to the  speakers  idiosyncrasies  during  the

conversation.   The   signature   table   method   gives  promise  of

contributing toward the solution of this problem as well as  being  a

possible answer to some of the more immediate problems.



	The  initial  thrust of the proposed work would be toward the

development of adaptive  learning  techniques,  using  the  signature

table  method  and  some  more recent variants and extensions of this

basic procedure. We have already demonstrated the usefulness of  this

method  for  the  initial  assignment  of significant features to the

acoustic signals. One of the next steps will be to extend the  method

to include acoustic-phonetic probabilities in the decision process.



	Still another  aspect to be  studied would  be the amount  of

prerocessing  that should  be done  and  the desired  balance between

bottom-up  and top-down  approaches.    It  is  fairly  obvious  that

decisions of this  sort should ideally be  made dynamically depending

upon  the familiarity of the system with  the domain of discourse and

with  the   characteristics   of   the  speaker.   Compromises   will

undoubtedly have to be  made in any immediately realizable system but

we should understand  better than we  now do the  limitations on  the

system that such compromises impose.


                                  3                                  
	It  may  be  well  at  this  point  to  describe  the general

philosophy that has been followed in the work that is currently under

way  and  the  results  that have been achieved to date. We have been

studying  elements  of  a  speech  recognition  system  that are  not

dependent upon the use of a limited vocabulary and that can recognize

continuous speech by a number of different speakers.



	Such a system should be able to function successfully  either

without any previous training for the specific speaker in question or

after a short training session in which the speaker would be asked to

repeat certain phrases designed to train the system on those phonetic

utterances that seemed to depart from the previously learned norm. In

either  case, it  is  believed  that some automatic or semi-automatic

training system should be employed to acquire the data that  is  used

for  the identification of the phonetic information in the speech. We

believe that this can best be done by employing a modification of the

signature  table  scheme previously described. A brief review of this

earlier form of signature table is given in reference 17.



	The over-all system  is envisioned as  one in which  the more

or  less conventional method is  used of separating  the input speech

into short time  slices for  which some sort  of frequency  analysis,

homomorphic,  linear predictive  coding,  or the like,   is done.  We

then  interpret this information in terms  of significant features by

means of a set of  signature tables.  At this point we  define longer

sections  of  the  speech  called  segments  which  are  obtained  by


                                  4                                  
grouping together  varying  numbers of  the  original slices  on  the

basis of their similarity.  This then takes the place  of other forms

of initial  segmentation.  Having identified a  series of segments in

this way  we next  use another  set of  signature  tables to  extract

information  from the  sequence of  segments  and combine  it with  a

limited  amount of  syntactic  and semantic  information to  define a

sequence of phonemes.



	While  it would be possible to extend this bottom up approach

still further, it seems reasonable to break off  at  this  point  and

revert  to  a  top down approach from here on. The real difference in

the overall system would then be that the  top  down  analysis  would

deal  with  the  outputs  from  the  signature  table  section as its

primitives rather than with the outputs from the initial measurements

either  in the time domain or in the frequency domain. In the case of

inconsistencies the system could either refer to the  second  choices

retained  within  the  signature tables or if need be could always go

clear back to the input parameters. The decision as  to  how  far  to

carry  the  initial  bottom up analysis must depend upon the relative

cost of this analysis both in complexity and processing time and  the

certainty  with  which it can be performed as compared with the costs

associated with the rest of the analysis and the certainty with which

that  can  be  performed,  taking  due notice of the costs in time of

recovering from false starts.






                                  5                                  
	Signature  tables  can  be  used  to  perform  four essential

functions that are required in the automatic recognition  of  speech.

These functions are: (1) the elimination of superfluous and redundant

information from the acoustic input stream, (2) the transformation of

the  remaining  information  from  one  coordinate  system  to a more

phonetically  meaningful  coordinate  system,  (3)  the   mixing   of

acoustically  derived  data  with  syntactic, semantic and linguistic

information  to  obtain  the  desired  recognition,   and   (4)   the

introduction of a learning mechanism.



	The  following  three  advantages  emerge from this method of

training and evaluation.

	1)  Essentially  arbitrary  inter-relationships  between  the

input terms are taken in account by any one table. The only  loss  of

accuracy is in the quantization.

	2) The training is a  very  simple  process  of  accumulating

counts.  The  training samples are introduced sequentially, and hence

simultaneous storage of all the samples is not required.

	3)  The  process  linearizes  the storage requirements in the

parameter space.



	The  signature tables, as used in speech recognition, must be

particularized to allow for the multi-category nature of the  output.

Several  forms  of tables have been investigated.  An overview of the

current system is given in Appendix 1. For some early results see SUR

Note  43  "Some  Preliminary  Experiments in Speech Recognition Using


                                  6                                  
Signature Tables" by R.B.Thosar and A.L.Samuel [20].



	Work  is  currently  under  way  on a major refinement of the

signature table  approach  which  adopts  a  somewhat  more  rigorous

procedure.  Preliminary  results  with  this  scheme  indicate that a

substantial improvement has been achieved. This effort  is  described

in  a recent report SUR Note 81 on "Estimation of Probability Density

Using Signature Tables for Application  to  Pattern  Recognition,  by

R.B.Thosar [21].



	We are currently involved in work on a segmentation procedure

which has already demonstrated its  ability  to  compete  with  other

proposed  segmentation systems, even when used to process speech from

speakers whose utterances were not used during the training sequence.


























                                  7                                  
2.  FACILITIES



The  computer  facilities of  the  Stanford  Artificial  Intelligence

Laboratory include the following equipment.



Central Processors:  Digital Equipment Corporation PDP-10 and PDP-6



Primary Store:       65K words of 1.7 microsecond DEC Core

	             65K words of 1 microsecond Ampex Core

                     131K words of 1.6 microsecond Ampex Core



Swapping Store:      Librascope disk (5 million words, 22 million

                     bits/second transfer rate)



File Store:          IBM 3334 disc file, 6 spindles (leased)



Peripherals:         4 DECtape drives, 2 mag tape drives, line printer,

	             Calcomp plotter, Xerox Graphics Printer



Communications

    Processor:	     BBN IMP (Honeywell DDP-516) connected to the

		     ARPA network.



Terminals:           58 TV displays, 6 III displays, 3 IMLAC displays,

	             1 ARDS display, 15 Teletype terminals




                                  8                                  
Special  Equipment:  Audio  input  and  output  systems, hand-eye

                     equipment (2 TV cameras, 3 arms), remote-

                     controlled cart



Existing and planned facilities will  be  adequate  to  support  this

proposal, hence no additional facilities are budgeted.










































                                  9                                  
3.  BUDGET
		
                 Two years beginning October 1, 1973


BUDGET CATEGORY					YEAR 1	YEAR 2
-----------------------------------------------------------------
I. SALARIES & WAGES:
	
	Samuel, A.L.,
	Senior Research Associate
	Principal Investigator, 75%		 20,000	 20,000

	------,
	Research Associate			 14,520	 14,520

	Miller, N.J.,
	Research Associate			 13,680	 13,680

	------,
	Student Research Assistant,
	50% academic year, 100% summer		  4,914	  5,070

	------,
	Student Research Assistant,
	50% academic year, 100% summer		  4,914	  5,070

	Reserve for Salary Increases
	@ 5.5% per year				  3,192	  6,592
						-------	-------

	TOTAL SALARIES AND WAGES		$61,220 $64,932

II. STAFF BENEFITS:

	17.0% 10-1-73 to 8-31-74		  9,540
	18.3% 9-1-74 to 8-31-75			    934  10,894
	19.3% 9-1-75 to 9-30-75				  1,042
						-------	-------
	TOTAL STAFF BENEFITS			$10,474 $11,936

III. TRAVEL:

	Domestic -
		Local		150
		East Coast	450
				---
						   $600    $600





                                 10                                  
IV.  EXPENDABLE MATERIALS & SERVICES:

	A. Telephone Service	480
	B. Office Supplies	600
				---
						 $1,080  $1,080

V.  PUBLICATIONS COST:

	2 Papers @ 500 ea.			 $1,000  $1,000
						------- -------

VI. TOTAL DIRECT COSTS:

	(Items I through V)			$74,374 $79,548

VII. INDIRECT COSTS:

	On Campus - 47% of NTDC			$34,956 $37,388

						-------	-------
VIII. TOTAL COSTS:

	(Items VI + VII)		       $109,330 $116,936          
					       -------- --------




























                                 11                                  
APPENDIX A.  INITIAL FORM OF SIGNATURE TABLE FOR SPEECH RECOGNITION



	The  signature tables, as used in speech recognition, must be

particularized to allow for the multi-category nature of the  output.

Several  forms  of  tables  have  been investigated. The initial form

tested and used for the data presented in  the  attached  paper  uses

tables  consisting of two parts, a preamble and the table proper. The

preamble contains: (1) space for saving a record of the  current  and

recent  output reports from the table, (2) identifying information as

to the specific type of table, (3) a parameter  that  identifies  the

desired  output  from  the  table  and  that  is used in the learning

process, (4) a gating parameter specifying the input, that is  to  be

used  to  gate  the  table,  (5) the sign of the gate, (6) the gating

level to be used and (7) parameters that identify the sources of  the

normal inputs to the table.



	All  inputs  are  limited  in  range  and  specify either the

absolute level of some basic property or more usually the probability

of some property being present. These inputs may be from the original

acoustic input or they may be the outputs of other  tables.  If  from

other  tables  they  may  be for the current time step or for earlier

time steps, (subject to practical limits as to  the  number  of  time

steps that are saved).








                                 12                                  
	The output, or outputs, from each table are similarly limited

in  range  and  specify,  in  all  cases,  a  probability  that  some

particular significant feature, phonette, phoneme, word segment, word

or phrase is present.



	We are limiting the range of inputs  and  outputs  to  values

specified  by  3  bits  and  the  number  of  entries per table to 64

although this choice of values  is  a  matter  to  be  determined  by

experiment.  We  are  also  providing  for any of the following input

combinations, (1) one input of 6 bits, (2) two inputs of 3 bits each,

(3)  three  inputs  of 2 bits each, and (4) six inputs of 1 bit each.

The uses to which these different forms are  put  will  be  described

later.



	The  body  of  each  table  contains entries corresponding to

every possible combination of  the  allowed  input  parameters.  Each

entry  in  the  table  actually  consists of several parts. There are

fields assigned to accumulate counts of the occurrences of  incidents

in  which  the  specifying  input values coincided with the different

desired outputs from the table  as  found  during  previous  learning

sessions  and  there  are fields containing the summarized results of

these learning sessions, which are used as outputs  from  the  table.

The  outputs from the tables can then express to the allowed accuracy

all possible functions of the input parameters.






                                 13                                  
Operation in the Training Mode



	When operating in  the training mode the program  is supplied

with  a  sequence  of stored  utterances  with  accompanying phonetic

transcriptions.  Each  sample  of  the  incoming  speech   signal  is

analyzed (Fourier transforms or  inverse filter equivalent) to obtain

the  necessary input  parameters for the  lowest level  tables in the

signature table hierarchy.  At  the same time reference is made  to a

table of  phonetic "hints" which  prescribe the desired  outputs from

each table  which correspond  to all  possible phonemic  inputs.  The

signature tables are then processed.



	The processing of each  table  is  done  in  two  steps,  one

process  at each entry to the table and the second only periodically.

The first process consists of locating a single entry line within the

table  as  specified by the inputs to the table and adding a 1 to the

appropriate field to indicate the presence of the property  specified

by  hint  table  as  corresponding  to  the  phoneme specified in the

phonemic transcription. At this time a report is also made as to  the

table's  output  as  determined from the averaged results of previous

learning so that a running record may be kept of the  performance  of

the   system.  At  periodic  intervals  all  tables  are  updated  to

incorporate recent learning results.  To  make  this  process  easily

understandable,  let  us  restrict  our  attention to a table used to

identify a single significant feature, say voicing.  The  hint  table

will identify whether or not the phoneme currently being processed is


                                 14                                  
to be considered voiced. If it is voiced, a 1 is added to  the  "yes"

field of the entry line located by the normal inputs to the table. If

it is not voiced, a 1 is added to the "no" field.  At  updating  time

the  output that this entry will subsequently report is determined by

dividing the accumulated sum in the "yes" field by  the  sum  of  the

numbers in the "yes" and the "no" fields, and reporting this quantity

as a number in the range from 0 to 7. Actually the process is  a  bit

more complicated than this and it varies with the exact type of table

under consideration, as reported in detail  elsewhere.  Outputs  from

the  signature tables are not probabilities, in the strict sense, but

are the statistically-arrived-at odds based on  the  actual  learning

sequence.



	The preamble of the  table has space for storing  twelve past

outputs.  An input to a  table can be  delayed to that  extent.  This

table relates outcomes  of previous events  with the preset  hint-the

learning input.  A certain  amount of  context dependent  learning is

thus  possible  with the  limitation  that the  specified  delays are

constant.



	The interconnected hierarchy  of tables form a  network which

runs  incrementally, in steps  synchronous with the  time window over

which the input signal is analyzed.  The present window width  is set

at 12.8 ms.(256 points at 20  K samples/sec.) with overlap of 6.4 ms.

Inputs  to  this  network  are  the  parameters  abstracted from  the

frequency analyzes  of the  signal,   and  the specified  hint.   The


                                 15                                  
outputs of  the network could  be either the  probability attached to

every phonetic symbol  or the  output of  a table  associated with  a

feature such as voiced, vowel  etc. The point to be made  is that the

output  generated  for a  sample  is essentially  independent  of its

contiguous samples. The  dependency achieved by  using delays in  the

inputs is invisible to the outputs.  The outputs thus report the best

estimate  on what the current  acoustic input is  with no relation to

the past  outputs. Relating  the  successive outputs  along the  time

dimension is realized by counters.



The Use of COUNTERS



	The transition from initial sample space  to segment space is

made   possible  by   means  of   COUNTERS   which  are   summed  and

reinitialized  whenever  their   inputs  cross  specified   threshold

values,  being triggered on when  the input exceeds the threshold and

off  when  it  falls  below.    Momentary  spikes  are eliminated  by

specifying time  hysteresis, the  number of  consecutive samples  for

which  the input  must  be above  the  threshold.   The  output of  a

counter  provides  information  about  starting  time,  duration  and

average input for the period it was active.



	Since a counter can reference a table at  any  level  in  the

hierarchy of tables, it can reflect any desired degree of information

reduction. For example, a counter may be set up to show a section  of

speech  to  be  a vowel, a front vowel or the vowel /I/. The counters


                                 16                                  
can be looked upon to represent a  mapping  of  parameter-time  space

into a feature-time space, or at a higher level symbol-time space. It

may be useful to carry along the feature information as a back up  in

those  situations where the symbolic information is not acceptable to

syntactic or semantic interpretation.



	In  the  same   manner  as  the  tables,  the   counters  run

completely  independent of  each  other.   In a  recognition  run the

counters may overlap in arbitrary  fashion, may leave out gaps  where

no counter has  been triggered or may not line  up nicely. A properly

segmented   output,  where  the  consecutive  sections  are  in  time

sequence and  are neatly  labeled,   is essential  for processing  it

further.  This  is achieved  by  registering  the  instants when  the

counters are  triggered  or terminated  to  form time  slices  called

segments.



	An  event is  the  period  between successive  activation  or

termination  of any counter. An  event shorter than  a specified time

is merely  ignored.   A record  of event  durations and  up to  three

active  counters,   ordered  according   to  their   probability,  is

maintained.



	An event resulting from  the  processing  described  so  far,

represents a phonette - one of the basic speech categories defined as

hints in the learning process. It is only an estimate of closeness to

a  speech category , based on past learning. Also each category has a


                                 17                                  
more-or-less stationary spectral characterization.  Thus  a  category

may  have  a phonemic equivalent as in the case of vowels , it may be

common to phoneme class as for the voiced or unvoiced stop gaps or it

may  be  subphonemic as a T-burst or a K-burst. The choices are based

on acoustic expediency, i.e. optimization of the learning rather than

any  linguistic  considerations.  However   higher level interpretive

programs may best operate on inputs resembling phonemic transcription.

The  contiguous  segments  may  be  coalesced into phoneme like units

using diadic or triadic  probabilities  and  acoustic-phonetic  rules

particular  to  the system. For example, a period of silence followed

by a type of burst or a short friction may be combined  to  form  the

corresponding  stop. A short friction or a burst following a nasal or

a lateral may be called a stop even if the silence period is short or

absent.  Clearly these rules must be specific to the system, based on

the confidence with  which  durations  and  phonette  categories  are

recognized.






















                                 18                                  
B.  SPEECH RESEARCH AT STANFORD UNIVERSITY



	Efforts  to  establish  a vocal  communication  link  with  a

digital  computer have  been underway at  Stanford since  1963. These

efforts have been  primarily concerned with  four areas of  research.

First,    basic  research   in  extracting  phonemic  and  linguistic

information  from speech  waveforms  has been  persued. Second,   the

application of automatic  learning processes have been  investigated.

Third,  the use  of syntax  and semantics  to aid  speech recognition

have been explored.  Finally,  the application of speech  recognition

systems  to  control  other processes  developed  at  the  Artificial

Intelligence  Facility have  been carried  out.   These  efforts have

been carried  out in  parallel with  varying  emphasis on  particular

factors at different times.  None  of the facets of this research has

been  solved completely.  However, each  limited success has provided

insight and direction which opened a wealth of  challenging, state of

the art, research projects.



	The fruits  of Stanford's speech research  program were first

seen in October  1964 when  Raj Reddy published  a report  describing

his preliminary  investigations on the  analysis of  speech waveforms

[1].  This report described  the initial  digital processes developed

for  analyzing  waveforms  of  vowels  and  consonants,   fundamental

frequency,   and formants.   These processes  were used as  the basis

for a simple vowel recognition system and synthesis of sounds.




                                 19                                  
By  1966  Reddy  had built  a  much larger  system  which  obtained a

phonemic transcription and  which achieved segmentation of  connected

phrases utilizing hypotheses testing  [2].  This system represented a

significant contribution towards speech sound segmentation [3].  This

system operated  on a subset  of the speech  of a  single cooperative

speaker.



	In 1967  Reddy and his  students had  refined several of  his

processes  and  published  papers  on  phoneme  grouping  for  speech

recognition [4],   pitch period determination  of speech sounds  [5],

and computer recognition of connected  speech [6]. At this time Reddy

was  considering the introduction  of learning into  his processes at

several stages.   He  was  also supervising  several related  student

projects including limited  vocabulary speech recognition,  a phoneme

string to word  string transcription  program,   a syllable  junction

program,  and telephone speech recognition.



	1968 was  an extremely  productive year  for Professor  Reddy

and  his  speech   group.    Pierre  Vicens  published  a  report  on

preprocessing for  speech analysis [7];  Reddy published  a paper  on

the computer  transcription of  phonemic symbols  [8]; Reddy  and Ann

Robinson  published  a  paper on  phoneme-to-grapheme  translation of

English [9];  Reddy and Vicens  published a  paper on procedures  for

segmentation of  connected speech [10];  and Reddy presented  a paper

in Japan on consonantal  clustering and connected speech  recognition

[11].   In addition to  this basic speech  research, a paper  by John


                                 20                                  
McCarthy  ,  Lester  Earnest,  Raj  Reddy,    and Pierre  Vicens  was

presented at  the 1968  Fall Joint  Computer  Conference entitled  "A

Computer With  Hands, Eyes,   and Ears" which,   in part,   described

the vocal control of the artificial arm developed at Stanford [12].



	By   1969  the  Stanford   developed  speech  processes  were

successfully segmenting  and  parsing  continuous utterances  from  a

restricted  syntax.  Pierre  Vicens produced  a report on  aspects of

speech recognition by computer which investigated the techniques  and

methodologies  which  are  useful in  achieving  close  to  real-time

recognition  of  speech [13].   In  March  of 1969,  Raj  Reddy, Dave

Espar, and  Art  Eisenson produced  a  16mm  color movie  with  sound

entitled "Hear  Here".  This film  described the state of  the speech

recognition  project  as of  Spring, 1969.    In addition,  Raj Reddy

completed  a report  on  the  use of  environmental,  syntactic,  and

probabilistic  constraints in vision  and speech  [14] and  Reddy and

R.B.  Neely  reported their  research on the  contextual analysis  of

phonemes of English [15].



	In 1970,  a paper  was presented by Raj Reddy, L.   D. Erman,

and  R.  B.   Neely concerning the speech  recognition project at the

IEEE  Systems  Science  and  Cybernetics  Conference.  At  this  time

Professor Reddy left Stanford to  join the faculty of Carnegie-Mellon

University  and Dr.   Arthur Samuel  became the head  of the Stanford

speech  research  efforts.   Dr.  Samuel  was  the  developer  of  an

extremely  successful machine  learning scheme  which  had previously


                                 21                                  
been applied  to the  game of  checkers [16],[17].   He  resolved  to

apply them to speech recognition.



	By  1971  the first report on  a  speech  recognition  system

utilizing Samuel's learning  scheme was written by George White [18].

This report  was  primarily concerned  with  the examination  of  the

properties of  signature trees and  the heuristics involved  in their

application  to  an  optimal  minimal  set  of  features  to  achieve

recognition.   Also at  this time,  M.M. Astrahan  produced a  report

describing  his research  on speech  analysis by  clustering,  or the

hyperphoneme method  [19].    This process  attempted  to  do  speech

recognition   by   mathematical  classifications   instead   of   the

traditional   phonemes   or   linguistic   categories.     This   was

accomplished  by  nearest-neighbor  classification  in  a  hyperspace

wherein cluster centers, or hyperphonemes, had been established.



	In  1972  R.B.Thosar and  A.L.    Samuel presented  a  report

concerning  some preliminary experiments  in speech recognition using

signature tables [20].   This approach  represented a general  attack

on speech recognition employing  learning mechanisms at each stage of

classification.



	The speech  effort in  1973 has  been devoted  to two  areas.

First,  a mathematically rigorous examination  and improvement of the

signature  table  learning mechanism  has been  accomplished  by R.B.

Thosar.  Second, a segmentation  scheme based on signature tables  is


                                 22                                  
being  developed  to  provide  accurate  segmentation  together  with

probabilities  or  confidence  values  for  the  most  likely phoneme

occuring during each  segment.  This  process attempts to extract  as

much  information about an  acoustic signal  as possible and  to pass

this information to higher level processes.  The preliminary  results

of  this  segmentation  scheme  will  be   presented  at  the  speech

segmentation   workshop  to  be  held   in  July  at  Carnegie-Mellon

University. In addition to these activities, a new, high  speed pitch

detection scheme  has been  developed by  J. A.  Moorer and  has been

submitted for publication [22].


































                                 23                                  
C.  BIBLIOGRAPHY

1.  D. Raj Reddy, "Experiments on Automatic Speech Recognition  by  a
Digital Computer", AIM-26, October 1964, 19 pages.

2.   D.  Raj  Reddy",  An  Approach to Computer Speech Recognition by
Direct Analysis of the Speech Waveform", AIM-43, September 1966,  144
pages.

3.   D.  Raj  Reddy, "Segmentation of Speech Sounds," J. Acoust. Soc.
Amer., August 1966.

4.  D. Raj Reddy,  "Phoneme  Grouping  for  Speech  Recognition,"  J.
Acoust. Soc. Amer., May, 1967.

5.   D.  Raj  Reddy,  "Pitch  Period Determination of Speech Sounds,"
Comm. ACM, June, 1967.

6.  D. Raj Reddy, "Computer  Recognition  of  Connected  Speech,"  J.
Acoust. Soc. Amer., August, 1967.

7.   Pierre  Vicens,  "Preprocessing  for  Speech  Analysis", AIM-71,
October 1968, 33 pages.

8.  D. Raj Reddy, "Computer Transcription of  Phonemic  Symbols",  J.
Acoust. Soc. Amer., August 1968.

9.   D. Raj Reddy, and Ann Robinson, "Phoneme-To-Grapheme Translation
of English", IEEE Trans. Audio and Electroacoustics, June 1968.

10.  D. Raj Reddy, and P. Vicens,  "Procedures  for  Segmentation  of
Connected Speech," J. Audio Eng. Soc., October 1968.

11.   D.  Raj  Reddy,  "Consonantal  Clustering  and Connected Speech
Recognition", Proc. Sixth International Congress of Acoustics, Vol. 2,
pp.  C-57 to C-60, Tokyo, 1968.

12.   John McCarthy, Lester Earnest, D. Raj Reddy, and Pierre Vicens,
"A Computer With Hands, Eyes, and  Ears",  Proceedings  of  the  Fall
Joint  Computer  Conference,  1968.  13.   Pierre  Vicens, Aspects of
Speech Recognition by Computer, AIM-85, April 1969, 210 pages.

14.   D.  Raj  Reddy,  "On  the  Use  of Environmental, Syntactic and
Probabilistic Constraints in  Vision  and  Speech",  AIM-78,  January
1969, 23 pages.

15.   D.  Raj Reddy and R. B. Neely, "Contextual Analysis of Phonemes
of English", AIM-79, January 1969, 71 pages.

16.   A.  L. Samuel, "Some Studies in Machine Learning Using the Game
of Checkers," IBM Journal 3, 211-229 (1959).


                                 24                                  
17.  A. L. Samuel, "Some Studies in Machine Learning Using  the  Game
of  Checkers,  II - Recent Progress," IBM Jour. of Res. and Dev., 11,
pp.  601-617.

18.  George M.  White,  "Machine Learning  Through  Signature  Trees.
Applications to Human Speech", AIM-136, October 1870, 40 pages.

19.   M.  M.  Astrahan, "Speech Analysis by Clustering, or the Hyper-
phoneme Method", AIM-124, June 1970, 22 pages.

20.  R. B. Thosar and A. L. Samuel, "Some Preliminary  Experiments in
Speech  Recognition  Using  Signature  Table  Learning",  ARPA Speech
Understanding Research Group Note 43.

21.   R. B.  Thosar,  "Estimation  of  Probability  Densities   using
Signature Tables for Application to Pattern Recognition", ARPA Speech
Understanding Research Group Note 81.

22.  J. A. Moorer, "The Optimum-Comb Method of Pitch Period Analysis
in Speech", AIM-207, July 1973.

































                                 25                                  
D.  COGNIZANT PERSONNEL


        For contractual matters:

		Sponsored Projects Office
                Stanford University
                Stanford, California 94305

                Telephone: (415) 321-2300, ext. 2883

        For technical and scientific matters regarding this proposal:

                Arthur L. Samuel
                Computer Science Department
                Stanford University
                Stanford, California 94305

                Telephone: (415) 321-2300, ext. 3330

        For administrative matters, including questions relating
        to the budget or property acquisition:

                Mr. Lester D. Earnest
                Computer Science Department
                Stanford University
                Stanford, California 94305

                Telephone: (415) 321-2300, ext. 4202
























                                 26